
According to the World Health Organization, air pollution kills an estimated 7 million people each year. This is the result of a combined effect of ambient (outdoor) pollution and household air pollution. Ambient air pollution contributes an estimated 4.2 million deaths per year to stroke, heart disease, lung cancer and chronic respiratory diseases, whilst household air pollution contributes an estimated 3.8 million deaths per year due to exposure to smoke from cooking fires. In this exploration, we will be primarily focused on ambient air pollution. To find more information, you can visit this WHO page.
WHO data finds that 9 out of every 10 people breathe air containing a high level of pollutants. Pollutants, simply put, are the presence of harmful, unwanted substances in the air that reduce air quality. In this tutorial, we will mainly explore 4 of the 5 main types of pollutants that are present throughout the United States ($PM$ is not in the database):
Note: Predictions might have been more accurate if the dataset contained particle pollution (or particulate matter - $PM$) level too, which would cause similarly important health issues.
Given how dangerous pollutants can be, we want to explore any trends regarding the presence of these pollutants in our air. We hope we can gain better insight into how air quality has changed over time and maybe find factors that contribute to higher pollutant levels.
Before we do anything, we must import the following libraries
## import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from datetime import datetime
from pandas.plotting import register_matplotlib_converters
import plotly.offline as pyo
import plotly.graph_objs as go
import plotly.express as px
import statsmodels.formula.api as sm
Now that we have all the libaries we need, we can get our dataset. We plan on using a dataset scraped from the U.S. EPA database. It contains data on four major pollutants (Nitrogen Dioxide, Sulphur Dioxide, Carbon Monoxide and Ozone) for every day from 2000 - 2016.
First, We must load in our dataset. We are going to load in our dataset from a CSV file locally stored at file path '../input'. You can find and download the dataset on Kaggle under US Pollution.
## read csv
df = pd.read_csv("../input/pollution_us_2000_2016.csv")
print('** successfully imported csv as dataframe **')
Oftentimes, we are presented with messy data, and we must clean the data so it becomes easier to work with. We also want to tidy the data. Tidy data is a standard method of displaying a multivariate set of data where the data set is arragned so that each attribute is a column and each entity is a row. Tidy Data is much easier to work with!
In this dataset, there are a large amount of columns which either do not provide any valuable information, repeat information presented in other columns, or we simply do not use. We delete such columns.
# drop unused columns
df = df.drop(['Unnamed: 0','State Code','County Code','Site Num','Address','NO2 Units','O3 Units','SO2 Units',
'CO Units','NO2 1st Max Value', 'NO2 1st Max Hour','O3 1st Max Value', 'O3 1st Max Hour',
'SO2 1st Max Value', 'SO2 1st Max Hour','CO 1st Max Value', 'CO 1st Max Hour'],axis=1)
df.head()
As seen above, some entries are duplications with the same observation date (Date Local column). Since there's no specific explanation for these duplications, we will calculate the mean values for each date and city.
Also, there are is a lot of missing data represented as NaN. We delete these rows (we do this here because there are a lot of duplicate data points that can cover up for the rows we lose).
We also delete entries with 'Country of Mexico' as its state since we are only dealing with pollutions in the U.S.
Lastly, change Date Local from string to datetime
# replace duplications with a single entry of mean values
df = df.groupby(['State','County','City','Date Local']).mean().reset_index()
# remove entries with NA
df = df.dropna(axis='rows')
# remove Mexico
df = df[df.State!='Country Of Mexico']
# Change date from string to date value
df['Date Local'] = pd.to_datetime(df['Date Local'], format='%Y-%m-%d')
The air quality level of each category ($NO_{2}$, etc.) is defined by the Air Quality Index (AQI). More detailed information on the Index can be found in the EPA Air Quality Basic Page.
According to the Air Quality Index guide provided by EPA (page 3), the highest of AQI values for each category is reported as the overall AQI value.
We create a new column to record the overall AQI.
# overall AQI
df['AQI'] = df[['NO2 AQI', 'O3 AQI', 'SO2 AQI', 'CO AQI']].max(axis=1)
We want to explore our data! There is plenty of insight to be found in the dataset and using plots and other visualization tools helps make these insights clear. As an example, we look at the Air Quality Index of Maryland across time. First we must get the data we want by using various pandas methods. Here, we use the groupby and mean method to get the monthly average AQI for Maryland.
# looking at Maryland
maryland = df[df['State'] == 'Maryland']
# Get monthly mean AQI values
maryland = maryland.groupby([maryland['Date Local'].dt.year.values, maryland['Date Local'].dt.month.values]).mean()
maryland = maryland.reset_index()
# cleaning datatable by clarifying column names
maryland['Date Local'] = pd.to_datetime(dict(year=maryland['level_0'], month=maryland['level_1'], day=1))
maryland = maryland.drop(['level_0', 'level_1'], axis=1)
We want to plot the data, so we use the matplotlib library and its subplots() method to look at individual graphs for $NO_2$, $O_3$, $SO_2$, $CO$, and overall AQI next to each other.
# Plotting
register_matplotlib_converters() #helps with date-time conversion from pandas
plt.rcParams['figure.figsize'] = [16, 8] #adjust graph width and length
# now we plot NO2 AQI, O3 AQI, SO2 AQI, CO AQI, and overall AQI
fig, axs = plt.subplots(2, 3)
axs[0,0].plot('Date Local', 'NO2 AQI', data=maryland) # NO2
axs[0,0].set_title('NO2 AQI')
axs[0,1].plot('Date Local', 'O3 AQI', data=maryland) # O3
axs[0,1].set_title('O3 AQI')
axs[1,0].plot('Date Local', 'SO2 AQI', data=maryland) # SO2
axs[1,0].set_title('SO2 AQI')
axs[1,1].plot('Date Local', 'CO AQI', data=maryland) # CO
axs[1,1].set_title('CO AQI')
axs[0,2].plot('Date Local', 'AQI', data=maryland) # AQI
axs[0,2].set_title('AQI')
fig.delaxes(axs[1, 2]) # remove unused plot
fig.suptitle("Marlyand Air Quality Metrics Across Time") #set title
We initially see here that $SO_{2}$ AQI and $NO_{2}$ AQI have generally decreased in Maryland, while $O_{3}$ AQI, $CO$ AQI, and the overall AQI have been more level. However, we notice something more interesting. We see that each AQI metric appears to follows a sinusoidal pattern, suggesting that there may a seasonal element to AQI in Maryland. We want to explore this further.
We will find the average overall AQI for each month of the year in our entire dataset to see if this pattern persists throughout the country. We will then plot a bar graph of these means. We first need to extract out the month from each data point in our dataset.
# We first capture the month of each entity
# mapping from month number to month name
month_num_to_name = {
1 : 'January',
2 : 'February',
3 : 'March',
4 : 'April',
5 : 'May',
6 : 'June',
7 : 'July',
8 : 'August',
9 : 'September',
10 : 'October',
11 : 'November',
12 : 'December'
}
# capture month number from original dataframe and apply above mapping
df_month = df.copy()
df_month['month'] = df['Date Local'].dt.month
df_month['month'] = df_month['month'].map(month_num_to_name)
# set order to categorical month value
df_month.month = pd.Categorical(df_month.month,
categories=['January','February','March','April','May','June','July',
'August','September','October','November','December'],
ordered=True)
Now that we have extracted out the months, we can calculate the mean AQI for each month. We then plot these results.
# plotting
# get each months mean
df_month_mean = df_month.groupby(['month']).mean()
df_month_mean = df_month_mean.reset_index()
# plot
df_month_mean.plot.bar(x='month', y='AQI', title="Mean Overall AQI by Month")
plt.ylabel('AQI')
plt.show()
Surely enough, this seasonal hypothesis seems to persist throughout the dataset. The summer months all appear to have relatively poorer air quality whilst the winter months have much better air quality. To find how significant this finding is, we are going to use the statsmodels.formula.api library to fit a linear regression model using a month term.
res = sm.ols('AQI~month', data=df_month).fit()
print(res.summary())
Interpretting this statsmodel summary is a little difficult, but we will walk through some results to clarify. To find more information, see the Stats Model website.
The Intercept here represents the month of January. The coefficent means that the average AQI in January is 34.38 with a standard error of .108. Looking at the P>|t| column, we see that the p-value < .05, so we can reject the null hypothesis of no-relationship. Looking one row below, we see that February has a coefficient of 1.86, meaning that the AQI in February is on average 1.86 greater in February than in January. We also see that on average the AQI is 13.09 greater in June than in January. The p-value for each month except November is less than .05, meaning all paramters are significantly different from zero. We can reject the null hypothesis of no significance.
We find this result quite interesting, and we had no idea that we would find this result. This highlights how you can find new insights by exploring the data more closely!
We quander that there may be a similar result within each week. We believe that air quality may be worst on weekdays (Mon-Fri) and better on weekends since less people are driving. We will explore to see if the dataset supports our thinking. We first need to grab the day of the week from each entity in our dataset.
df_weekdays = df.copy()
def day_conv (d):
return d.dayofweek
# get day of week using above function
df_weekdays['day_of_week'] = df_weekdays['Date Local'].apply(day_conv)
We now run a linear regression model with a day of the week term as we did with previously with months.
res = sm.ols('AQI~day_of_week', data=df_weekdays).fit()
print(res.summary())
As seen by the bar graph above there doesnt seem to be any correlation with AQI levels and day of the week. This is quite a surprising finding considering many work Monday-Friday jobs, so driving activity, and subsequently emissions release should be lower theoretically. By this graph you can see how on weekdays traffic congestion is higher than on weekends. Saturday and Sunday are lower on average, even if only by a fraction of an AQI level as seen by the data table, and by the regression results we can see a p-value of .669 so there seems to be no correlation with the day of the week and AQI. The coeffecient is also extraordinarily small at -0.0065 which further shows how similiar the days are in terms of AQI.
We now plot the mean AQI for each day of the week.
def day_str (e):
if e==0:
return "Sunday"
elif e==1:
return "Monday"
elif e==2:
return "Tuesday"
elif e==3:
return "Wednesday"
elif e==4:
return "Thursday"
elif e==5:
return "Friday"
elif e==6:
return "Saturday"
# get mean
df_weekdays = df_weekdays.groupby([df_weekdays.day_of_week]).mean().reset_index()
df_weekdays["day_of_week"] = df_weekdays["day_of_week"].apply(day_str)
# set order to weekdays
df_weekdays.day_of_week = pd.Categorical(df_weekdays.day_of_week,
categories=['Monday','Tuesday','Wednesday','Thursday','Friday','Saturday','Sunday'],
ordered=True)
# plot
df_weekdays.plot.bar(x='day_of_week', y='AQI', title="Mean Overall AQI by Month")
plt.ylabel('AQI')
plt.show()
As we can see in the graph, there is no trends from each day, matching the results we found from our linear regression model. We conclude that our original thinking was incorrect.
Now we move onto something different. Earlier we plotted Maryland's AQI over time. However, we are also interested in how AQI has changed over time for the rest of the United States.
We first are going to look at the 6 states with with highest (worst) AQI mean from 2000-2016. Let's get a list of those states, and then let's plot the yearly mean AQI for each, similarly to what we did with Maryland.
# top 6 worst states by mean AQI from 2000-2016
worst_states = df.groupby(['State']).mean().reset_index().sort_values(by='AQI', ascending=False).iloc[:6].State
worst_states = worst_states.array
# plotting yearly mean of worst air quality states
df2 = df.copy()
df2 = df2.groupby([df2.State, df2['Date Local'].dt.year.values]).mean().reset_index()
df2['Date Local'] = pd.to_datetime(dict(year=df2['level_1'], month=1, day=1))
df2 = df2.drop(['level_1'], axis=1)
fig, axs = plt.subplots(2, 3)
for state in worst_states:
axs[0,0].plot('Date Local', 'NO2 AQI', data=df2[df2.State == state], label=state)
axs[0,0].set_title('NO2 AQI')
axs[0,1].plot('Date Local', 'O3 AQI', data=df2[df2.State == state], label=state)
axs[0,1].set_title('O3 AQI')
axs[1,0].plot('Date Local', 'SO2 AQI', data=df2[df2.State == state], label=state)
axs[1,0].set_title('SO2 AQI')
axs[1,1].plot('Date Local', 'CO AQI', data=df2[df2.State == state], label=state)
axs[1,1].set_title('CO AQI')
axs[0,2].plot('Date Local', 'AQI', data=df2[df2.State == state], label=state)
axs[0,2].set_title('AQI')
axs[1,2].plot(label=state)
axs[1,0].legend(loc='best') #add legend
fig.delaxes(axs[1, 2]) # remove unused plot
fig.suptitle("Worst States Air Quality Metrics (Yearly Mean) Across Time") #set title
These charts give us some insight. For example, we can see that for the worst states, $SO_{2}$ AQI and $CO$ AQI has generally gone down, while $NO_{2}$ AQI, $O3$ AQI, and overall AQI have - for the most part - stayed the same.
The charts give us a good idea of how different states have performed in terms of AQI over time. However, we would like to see information for all states, yet plotting all states on one chart would be too cluttered. To resolve this issue, we use the plotly.express library to show all the states at once on a map. This gives us a different visualization tool that might give us some new insights.
We plan on using the plotly.express.chorolopleth method which requires the U.S. states to be in its 2-letter abbreviation form. Thus we use a dictionary taken from this github page mapping U.S. state names to their 2-letter abbreviations. We load in the dictionary saved at file location "./dictionaries/us_state_abbrev_dict.npy" below, which creates a state name to abbreviation mapping, and then from that we create a mapping in the reverse direction.
# loads dictionary that maps from full U.S. state name to two-letter abbreviation
us_state_abbrev = np.load('dictionaries/us_state_abbrev_dict.npy', allow_pickle='TRUE').item()
# creates a dictionary that maps in the reverse direction
abbrev_us_state = dict(map(reversed, us_state_abbrev.items()))
Now we want to get the data we are looking for. Specifically, we are going to look at the yearly average overall AQI for each state, so we can visualize states air quality over time. We use various pandas methods below to do so.
# makes copy of original dataframe
df_states = df.copy()
# gets state yearly mean AQI
df_states = df_states.groupby(['State', df_states['Date Local'].dt.year.values]).mean().reset_index()
# rename column names
df_states = df_states.rename(columns={'level_1': 'year', 'AQI': 'Mean AQI'})
# get the used columns
df_states = df_states[['State','Mean AQI', 'year']]
# map the state names to the corresponding 2-letter abbreviation
df_states['State'] = df_states['State'].map(us_state_abbrev)
We also create a new column called 'AQI range' that categorizes the mean AQI into ranges. This will help us color-coordinate each state's AQI when we create our plotly map.
def AQI_range(mean_aqi):
if mean_aqi < 30:
return '< 30'
elif mean_aqi < 35:
return '30-35'
elif mean_aqi < 40:
return '35-40'
elif mean_aqi < 45:
return '40-45'
else:
return '> 45'
# apply above function to dataframe
df_states['AQI range'] = df_states['Mean AQI'].apply(AQI_range)
df_states.head()
Now that we have the data we want, we use the plotly.express library and its choropleth() method to visualize the data. We can play the animation to see how different states' overall AQI has varied year to year.
# Set notebook mode to work in offline
pyo.init_notebook_mode()
# set map settings
fig = px.choropleth(df_states, locations='State', locationmode="USA-states",
color='AQI range', scope="usa",
color_discrete_map={
'< 30': "grey",
'30-35': "royalblue",
'35-40': "purple",
'40-45': "orange",
'> 45': "red"
},
category_orders={"year": list(range(2000,2017))}, animation_frame="year",
title="Heat Map of States Mean AQI 2000-2016")
# show map
fig.show()
Unfortunately, as is common in many datasets, we are missing data for different states in different years, explaining why some states are greyed out. However, we can still see some cool trends. For example, it appears that Northern states like North Dakota, Idaho, Seattle, Maine, etc have all generally had better AQIs as they are more typically in the < 30 AQI range. Moreover, the county as a whole seems to be have an improving AQI over time. In the 2000-2005 range, it was commonplace to see mean AQIs > 45. Yet from 2006-2016, no state in the dataset had a mean AQI > 40.
df_year = df_month
df_year['year'] = df_year['Date Local'].dt.year.values
df_year_mean = df_year.groupby(['year']).mean().reset_index()
# plot
df_year_mean.plot.bar(x='year', y='AQI', title="Mean Overall AQI by Year")
plt.ylabel('AQI')
plt.show()
There does appear to be a decline in overall AQI, especially in the years 2002-2009. As we did earlier with each month, lets find the significance of these results.
res = sm.ols('AQI~year', data=df_month).fit()
print(res.summary())
We see that on average the overall AQI has declined by .5511 each year from 2000-2016, and that this finding is significantly different from zero (p-value < .05). This is quite interesting. Before beginning this exploratio, we thought that air quality would be decreasing over time because of global warming, but the data suggests otherwise air quality in the U.S. is actually improving. We now have found that both year and month have a significant relationship with overall AQI. As a point of inquiry, one may want to create a linear regression with both a month and year term and see how well the model fits the data.
We have now looked at how states AQI has varied over time. Another thing that we are interested in is how different cities AQIs have varied overtime. To do this, we want to get the longitude and latitudes of each city, so we can map the locations onto a plotly.graph_objects.Scattergeo() map. This piece of code uses OpenCageData's free API, which allows us to get longitude and latitudes based on location names we input. For example, this piece of code allows us to get the latitude and longitude of College Park, Maryland.
We want to do this for all unique cities in our dataset. So we first grab all the unique cities.
# Gets List of unique cities in our dataset
df_unique_cities = df.copy()
df_unique_cities = df_unique_cities[['State', 'City']]
df_unique_cities = df_unique_cities.groupby(['State', 'City']).size().reset_index().rename(columns={0:'count'})
df_unique_cities['City'] = df_unique_cities.City + ', ' + df_unique_cities.State
df_unique_cities = df_unique_cities[['City']]
unique_cities = df_unique_cities.City.array
We then execute the following code to get the longitude and latitudes of all distinct cities in our dataframe.
Using the API takes a bit of time and we have limited free requests, so we stored the dictionary resulting from the above code in a file located at "./dictionaries/city_to_lon_lat_dict.npy" using the np.save() method. To run the code yourself, you can get a free API key from OpenCageData. We now read from the saved file.
# read in dictionary mapping city names to longitude/latitude
city_to_lon_lat = np.load('dictionaries/city_to_lon_lat_dict.npy', allow_pickle='TRUE').item()
We now need to create a dataframe that has all the unique cities along with their corresponding longitudes and latitudes. We apply the above mapping to do so.
df_cities = df.copy() # copy original df
# gets yearly mean AQI for each city
df_cities['City'] = df_cities['City'] + ', ' + df_cities['State']
df_cities = df_cities[['City', 'Date Local', 'AQI']]
df_cities = df_cities[df_cities['City'].isin(unique_cities)]
df_cities = df_cities.groupby(['City', df_cities['Date Local'].dt.year]).mean().reset_index()
df_cities = df_cities.rename(columns={'Date Local': 'year', 'AQI' : 'Mean AQI'})
# applies the dictionary mapping to insert longitude and latitudes into the datafram for each city
df_cities['lon lat'] = df_cities.City.map(city_to_lon_lat)
# separate longitude and latitude into separate columns
df_cities[['lon','lat']] = pd.DataFrame(df_cities['lon lat'].tolist(), index=df_cities.index)
df_cities.head()
Now that we have the data we want, we are going to use the plotly.graph_objects and its Scattergeo() method to plot these cities and their AQI on a map. The Scattergeo method takes in a color argument, so we created our own discrete color scale below.
def get_color(aqi):
if aqi < 30:
return "grey"
elif aqi < 35:
return "royalblue"
elif aqi < 40:
return "purple"
elif aqi < 45:
return "orange"
else:
return "red"
df_cities['color'] = df_cities['Mean AQI'].apply(get_color)
The Scattergeo() method also takes in a text argument which is displayed when you hover over the displayed point on the map. We create our text argument that includes City name and AQI.
df_cities['text'] = df_cities['City'] + ', AQI: ' + df_cities['Mean AQI'].apply(str)
The rest of the code required to make the map is quite complicated, so we will not go over each detail explicitly. You can read the in-line comments to get a better idea of what is going on or you can visit this plotly tutorial, which we used as inspiration.
# set year = 2000 frame
data=go.Scattergeo(
lon = df_cities[df_cities.year == 2000]['lon'],
lat = df_cities[df_cities.year == 2000]['lat'],
text = df_cities[df_cities.year == 2000]['text'],
mode = 'markers',
marker = dict(
size = df_cities[df_cities.year == 2000]['Mean AQI']**3/8000, #size of circle
color = df_cities[df_cities.year == 2000]['color'],
line_color='rgb(40,40,40)',
))
# creates a list of frames. These will be used to create a play/plause animation through the years
frames = [dict(data=[go.Scattergeo(
lon = df_cities[df_cities.year == year]['lon'],
lat = df_cities[df_cities.year == year]['lat'],
text = df_cities[df_cities.year == year]['text'],
mode = 'markers',
marker = dict(
size = df_cities[df_cities.year == year]['Mean AQI']**3/8000,
color = df_cities[df_cities.year == year]['color'],
line_color='rgb(40,40,40)'
)
)],
layout = go.Layout(title_text="Cities with Worst AQI " + str(year))
) for year in range(2001,2017)]
# sets the layout of the map
layout = go.Layout(
title = go.layout.Title(
text = 'Cities With Worst AQI 2000-2016'
),
showlegend = False,
geo = go.layout.Geo(
scope = 'world',
projection = go.layout.geo.Projection(
type='albers usa'
),
showland = True,
landcolor = 'rgb(217, 217, 217)',
subunitwidth=1,
countrywidth=1,
subunitcolor="rgb(255, 255, 255)",
countrycolor="rgb(255, 255, 255)",
showsubunits = True,
showcountries = True
),
)
# creates the play/pause buttons
layout.update(updatemenus=[ {
"buttons": [
# play button
{
"args": [None, {"frame": {"duration": 500, "redraw": True},
"fromcurrent": True, "transition": {"duration": 300,
"easing": "quadratic-in-out"}}],
"label": "Play",
"method": "animate"
},
# pause button
{
"args": [[None], {"frame": {"duration": 0, "redraw": True},
"mode": "immediate",
"transition": {"duration": 0}}],
"label": "Pause",
"method": "animate"
}
],
# formatting
"direction": "left",
"pad": {"r": 10, "t": 0},
"showactive": False,
"type": "buttons",
"x": 0.1,
"xanchor": "right",
"y": 0,
"yanchor": "top"
}])
# finally, create the figure and plot
fig = go.Figure(data= data, layout=layout, frames=frames)
pyo.iplot(fig)
Unforunately, not all cities appear consistently in the dataset, so it is hard to make certain conclusions from this visualization. However, we can still see a great decrease in the number of cities with an AQI > 45 over time. This is consistent with our finding that the mean AQI in the US has decreased. Another interesting finding is the large number of California cities with poor air quality (in all years except 2016, where many cities no longer appear in the dataset). This is probably due to the forest fires in California.
We broke down the data to decipher patterns. From this we can see that there is a correlation with time, specifically season (month) of the year, and pollution level. The hotter months of Spring and Summer have the highest AQI levels on average throughout all the years of 2000-2016. There are still other questions to be explored such as why is there a general trend of pollution AQI levels decreasing over time as seen by our observations? Are events like summer vacation and spring break affecting the emission levels due to increased travel? Or are location and population bigger indicators since California seems to have high AQI levels based on the heat map. There are still many observations to be made with the dataset along with the help of possibly additional data sets like vacation and population data, but the analysis of the pollution data set still revealed very core keypoints such as the affect of season on AQI levels, and the finding that there has been a general decline in overall AQI in the U.S. from 2000-2016.